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Summary 

The  main  focus  of  our  project  was  thus  to  address  issues  related  to  fault-tolerant  guarantees  of  message  deadlines 
in  communication  networks,  i.e.,  no  matter  what  happens  (even  in  the  presence  of  a  network  fault),  the 
messages  will  be  transmitted  before  their  deadlines.  We  have  focused  on  FDDI  (Fiber  Distributed  Data  Interface) 
networks  for  this  study  because  they  are  well  suited  for  hard  real-time  communications,  due  not  only  to  their 
high  bandwidth,  but  also  to  their  property  of  bounded  token  rotation  time  and  to  their  dual  ring  architecture. 
We  also  extend  our  results  into  ATM  (Asynchronous  Transfer  Mode)  networks.  To  the  best  of  our  knowledge, 
no  previous  work  in  this  area  has  been  reported.  The  project  was  carried  out  in  four  tasks:  1)  Development  of 
delay  guarantee  methods  for  non  faulty  situations;  2)  Development  of  fault  management  methods;  3) 
Development  of  delay  guarantee  methods  for  faulty  situations;  4)  Extension  to  ATM  networks.  We 
demonstrated  that  with  the  technology  we  developed,  the  project  objective  is  successfully  achieved.  In 
particular,  one  of  the  bandwidth  algorithm  we  designed  and  analyzed  has  been  officially  adopted  by  the  DoD 
SAFENET. 


1 .  Overview  of  the  Project 

High-speed  networks  are  being  increasingly  deployed  in  distributed  computer  systems.  These  networks  may  have 
stringent  real-time  and  fault-tolerance  requirements.  Besides  tolerating  an  occasional  fault  caused  by  component  wear- 
out,  these  networks  must  also  recover  from  multiple  faults  caused  by  non-natural  forces.  Consider,  for  example,  a 
network  employed  on  a  battleship  to  support  real-time  communication  between  various  on-board  devices.  An  enemy 
attack  may  break  several  communication  links  in  this  network.  The  dual  requirements  of  providing  real-time 
communication  and  fault-tolerance  means  that  such  networks  must  guarantee  the  delivery  of  critical  messages  on 
time  even  in  some  faulty  situations.  Designing  such  networks  is  a  challenging  task.  The  main  focus  of  our  project 
was  thus  to  address  issues  related  to  fault-tolerant  guarantees  of  synchronous  message  deadlines  in  communication 
networks,  i.e.,  no  matter  what  happens  (even  in  the  presence  of  a  network  fault),  the  messages  will  be  transmitted 
before  their  deadlines. 

We  have  focused  on  FDDI  (Fiber  Distributed  Data  Interface)  networks  for  this  study  because  they  are  well  suited  for 
hard  real-time  communications,  due  not  only  to  their  high  bandwidth,  but  also  to  their  property  of  bounded  token 
rotation  time  and  to  their  dual  ring  architecture.  A  bounded  token  rotation  time  provides  a  necessary  condition  to 
guarantee  hard  real-time  deadlines  while  the  dual  ring  architecture  allows  the  maintenance  of  a  continuous  real-time 
service  under  some  failure  conditions.  Although  indispensable,  the  bounded  token  rotation  time  and  the  dual  ring 
architecture  alone  are  inadequate  for  guaranteeing  message  deadlines.  In  addition  to  these  features,  synchronous 
bandwidth  allocation  also  plays  a  key  role  in  the  timely  delivery  of  synchronous  messages.  Furthermore,  under  the 
proposed  standard,  upon  the  occurrence  of  a  fault,  detection  and  recovery  processes  may  take  several  seconds  to 
complete.  This  is  too  long  to  satisfy  message  deadlines  in  many  hard  real-time  applications.  In  this  project,  we 
rectified  the  situation.  We  also  extend  our  results  into  ATM  (Asynchronous  Transfer  Mode)  networks. 

The  project  was  carried  out  in  four  tasks:  1)  Development  of  delay  guarantee  methods  for  non  faulty  situations;  2) 
Development  of  fault  management  methods;  3)  Development  of  delay  guarantee  methods  for  faulty  situations;  4) 
Extension  to  ATM  networks.  The  key  contributions  which  made  our  work  innovative  and  unique  are  as  follows: 
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•  We  developed  techniques  that  are  able  to  guarantee  message  deadlines  at  all  times .  With  our  approach, 
the  message  deadlines  are  met  before  the  fault  occurs,  during  fault  detection  and  re-configuration,  and 
after  re-configuration.  To  the  best  of  our  knowledge,  no  previous  work  in  this  area  has  been  reported. 

•  Our  approach  is  compatible  with  the  FDDI  and  SAFENET  standards.  Hence,  the  results  obtained 
from  our  work  are  immediately  applicable  to  the  design  and  analysis  of  distributed  hard  real-time 
systems  where  a  SAFENET  network  is  used. 

•  Our  work  has  contributed  significantly  to  the  state  of  the  art  in  the  theoiy  of  hard  real-time  scheduling 
and  communication.  We  analyzed  the  system  by  deriving  its  worst  case  utilization  bound.  This  metric 
is  particularly  important  because  it  indicates  the  safety  margin  of  the  system  and  provides  a  measure  of 
system  stability.  All  previous  work  regarding  this  measure  is  related  to  the  rate  monotonic  scheduling 
algorithm.  Our  work  was  the  very  first  which  derives  the  worst  case  utilization  bound  for  a  scheduling 
environment  where  global  priority  arbitration  is  not  supported  and  hence  the  rate  monotonic  algorithm 
cannot  be  used. 

•  For  ATM  networks,  we  also  developed  deadline  guarantee  methods  which  are  compatible  with  the 
current  products  and  standards.  Hence,  we  were  able  to  be  the  first  to  develop  a  toolkit  (NetEx)  that  can 
provide  end-to-end  deadline  guarantees  in  ATM  networks  that  are  currently  commercially  available. 


2.  Accomplished  Tasks 


The  P.I.  and  his  graduate  students  during  this  project  successfully  accomplished  four  research  tasks  as  outlined  in 
the  proposal.  The  major  achievement  of  these  tasks  are  summarized  below. 

2.1  Delay  Guarantees  in  FDDI  Networks 

Our  method  for  delay  guarantees  over  FDDI  networks  is  based  on  synchronous  bandwidth  allocation.  Synchronous 
bandwidth  allocation  schemes  may  be  divided  into  two  classes:  local  allocation  schemes  and  global  allocation 
schemes.  These  schemes  differ  in  the  type  of  information  they  may  use.  A  local  synchronous  bandwidth  allocation 
scheme  can  only  use  information  available  locally  to  node  i  in  allocating  Hi.  Locally  available  information  at  node 
$i$  includes  the  parameters  of  stream  S  (i.e.,  Cj(  Pi,  Di),  and  TTRT(Target  Token  Rotation  Time).  On  the  other 
hand,  a  global  synchronous  bandwidth  allocation  scheme  can  use  global  information  in  its  allocation  of  synchronous 
bandwidth  to  a  node.  Global  information  includes  both  local  information  and  information  regarding  the  parameters 
of  synchronous  message  streams  originating  at  other  nodes. 

A  local  scheme  is  preferable  from  a  network  management  perspective.  If  the  parameters  of  stream  Si  change,  then  only 
the  synchronous  bandwidth  Si  of  node  i  needs  to  be  recalculated.  The  synchronous  bandwidths  at  other  nodes  do 
not  need  to  be  changed  because  they  were  calculated  independently.  This  makes  a  local  scheme  flexible  and  suited 
for  use  in  dynamic  environments  where  synchronous  message  streams  are  dynamically  initiated  or  terminated. 

In  a  global  scheme,  if  the  parameters  of  Si  change,  it  may  be  necessary  to  re-compute  the  synchronous  bandwidths  for 
all  nodes.  Therefore  a  global  scheme  is  not  well  suited  for  a  dynamic  environment.  In  addition,  the  extra 
information  employed  by  a  global  scheme  may  cause  it  to  handle  more  traffic  than  a  local  scheme.  However,  it  is 
known  that  local  schemes  can  perform  very  closely  to  the  optimal  synchronous  bandwidth  allocation  scheme  when 
message  deadlines  are  equal  to  message  periods.  Consequently,  given  the  previously  demonstrated  good 
performance  of  local  schemes  and  their  desirable  network  management  properties,  we  concentrate  on  local 
synchronous  bandwidth  allocation  schemes  in  this  chapter. 

We  have  proposed  and  analyzed  several  local  and  global  synchronous  bandwidth  allocation  schemes.  Here  we 
introduce  the  one  that  was  officially  adopted  by  SAFENET.  With  this  scheme,  the  synchronous  bandwidth  for  node 
i  is  allocated  according  to  the  following  formula: 


(1) 
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Intuitively,  this  scheme  follows  the  flow  conservation  principle.  Between  the  arrival  of  a  message  and  its  absolute 


deadline,  which  is  D,  time  later,  node  i  will  have  at  least 
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time  available  for  synchronous  messages.  This  transmission  time  is  available  regardless  of  the  number  cf 
asynchronous  messages  in  the  network.  During  the  Dj  time ,  Ui  Di  can  loosely  be  regarded  as  the  load  on  node  i. 
Thus  the  synchronous  bandwidth  in  (1)  is  just  sufficient  to  handle  the  load  on  node  i  between  the  arrival  of  a 
message  and  its  deadline.  An  advantage  of  using  this  method  is  that  it  comes  with  a  simple  deadline  guarantee 
testing  procedure:  as  long  as  the  summation  of  bandwidths  allocated  to  individual  nodes  is  no  more  than  TTRT,  all 
the  messages  will  meet  their  deadlines. 


2.2  Fault  Management  in  FDDI  Networks 

As  we  mentioned  earlier,  there  were  some  serious  problems  in  the  fault-tolerance  capability  of  SAFENET/FDDI 
architecture.  For  example,  two  trunk  link  faults  may  disconnect  the  network,  but  many  mission-critical  applications 
must  survive  under  multiple  fault  situations.  In  this  part  of  the  project,  we  aimed  at  significantly  improving  the 
SAFENET/FDDI  architecture  and  hence  providing  efficient  and  effective  fault-tolerant  real-time  capabilities. 

The  basic  idea  is  to  use  multiple  FDDI  trunk  rings.  The  space  redundancy  provided  by  a  multi-ring  architecture 
should  improve  the  system  performance.  However,  the  key  to  fully  realizing  the  potential  of  this  new  architecture  is 
to  develop  good  reconfiguration  algorithms  to  handle  faults.  We  developed  an  architecture  called  FBRN  (FDDI- 
Based  Reconfigurable  Network)  that  consists  of  Sr$  FDDI  rings.  An  FBRN  achieves  a  high  degree  of  fault-tolerance 
by  1)  using  multiple  FDDI  ring  networks  to  connect  the  hosts  and  2)  using  efficient  fault  detection  and  network 
reconfiguration  algorithms. 

Fault  detection  within  FBRN  is  a  decentralized  process  with  each  FDDI  ring  performing  its  standard  fault  detection 
operations.  Each  FDDI  trunk  ring  has  its  fault  recovery  mechanism  which  enables  the  ring  to  recover  from  node 
faults  using  the  bypass  switch  and  from  single  point  trunk  faults  using  the  wrap  up  operation.  FBRN  also  has  a 
network-level  fault  recovery  mechanism  which  monitors  the  available  bandwidth  in  the  network  and  invokes  a 
reconfiguration  algorithm  when  the  lower  level  FDDI-based  recovery  methods  are  inadequate.  The  problem  of 
designing  efficient  reconfiguration  algorithms  for  FBRN  has  been  extensively  addressed  [???].  The  reconfiguration 
algorithm  is  optimal  in  the  sense  that  it  always  produces  a  configuration  that  has  the  largest  number  of  functional 
rings  for  the  given  fault  pattern.  One  disadvantage  of  the  optimal  algorithm  is  that  it  requires  global  information 
regarding  the  system  fault  status.  In  real-time  systems,  the  communication  overheads  incurred  in  collecting  this 
information  may  be  intolerable.  The  local  reconfiguration  algorithm  described  [42]  does  not  suffer  from  this  problem. 
The  local  algorithm  operates  in  a  fully  distributed  fashion  utilizing  only  locally  available  information  at  each  node. 
Further,  with  this  algorithm,  the  reconfiguration  process  is  transparent  to  the  fault-free  rings;  the  ongoing  traffic  on 
these  rings  is  unaffected  by  the  reconfiguration  process.  Although  the  local  algorithm  is  not  optimal,  it  is 
demonstrated  to  have  a  near-optimal  performance  [3 8]. The  performance  data  showed  that  with  our  algorithms,  the 
system  fault-tolerant  capability  was  greatly  improved.  For  example,  with  25%  of  faulty  links,  the  network,  on 
average,  can  still  provide  high  bandwidth  in  comparison  with  the  one  without  using  our  methods.  Hence,  we 
recommend  to  employ  the  local  reconfiguration  algorithm  in  practical  systems. 


2.3  Fault-Tolerant  Deadline  Guarantees  in  FDDI  Networks 

The  objective  of  this  part  of  the  project  is  to  provide  fault-tolerant  real-time  management  in  FBRN.  That  is  to 
manipulate  the  network  resources  and  messages  so  that  the  deadline  guaranteed  communication  can  be  provided  at 


all  time.  We  divide  the  functionalities  of  this  management  into  two  parts,  viz.,  off-line  management  and  on-line 
management.  Off-line  management  involves  message  assignments,  bandwidth  allocation  and  verification  that  the 
fault-tolerant  real-time  requirements  can  be  met.  On-line  management  is  responsible  for  detecting  faults,  reconfiguring 
the  network  in  the  event  of  faults,  migrating  messages  from  faulty  rings  to  non-faulty  ones  if  necessary,  and  so  on. 

For  our  solution  to  be  practical,  we  have  to  minimize  the  overheads  involved  in  on-line  management.  Imagine  a  set 
of  messages  already  assigned  to  different  rings.  At  run-time  some  ring  may  be  detected  as  faulty.  It  is  the  task  of  on¬ 
line  management  to  decide  what  needs  to  be  done  about  the  messages  that  were  being  transmitted  on  that  ring.  One 
approach  is  to  dynamically  revise  all  the  message-to-ring  assignments  whenever  a  ring  fault  is  detected.  With  this 
approach,  one  may  be  able  to  fully  utilize  the  network  resources  while  attempting  to  meet  the  message  requirements. 
Clearly,  this  method  involves  a  large  run-time  overhead  and  is  not  practical  for  real-time  applications. 

We  adopt  a  group-based  management  approach  which  deals  with  message  groups  rather  than  individual  messages. 
This  approach  greatly  reduces  overheads  associated  with  on-line  management.  In  group-based  approach,  messages 
are  grouped  together  based  on  certain  criteria.  All  messages  belonging  to  a  group  are  assigned  to  a  single  ring.  With 
the  group-based  approach,  off-line  management  is  responsible  for  message  grouping,  bandwidth  allocation  and 
schedulability  verification  while  on-line  management  is  responsible  for  network  initialization,  fault  detection, 
network  reconfiguration,  message  group  migration  and  re-initialization. 

We  observed  that  the  performance  of  the  system  critically  depends  on  the  message  grouping  strategies  used  in  off-line 
management.  We  considered  three  approaches:  pure  temporal  redundancy,  pure  spatial  redundancy  and  integrated 
approach.  Pure  temporal  redundancy  approach  uses  only  one  copy  of  each  message  but  allocates  sufficient  bandwidth 
to  allow  additional  time  for  the  message  to  migrate  from  one  ring  to  another  in  the  event  of  a  fault.  Pure  spatial 
redundancy  approach  uses  multiple  copies  of  the  message  and  assigns  them  to  different  rings  to  achieve  fault 
tolerance.  The  temporal  redundancy  approach  may  require  less  total  bandwidth  as  compared  to  the  spatial 
redundancy  approach  but  may  sometimes  be  unfeasible  due  to  excessive  message  migration  overheads.  The  spatial 
redundancy  approach  obviates  the  need  for  migration,  but  may  be  rendered  unfeasible  due  to  excessive  bandwidth 
requirements.  Our  third  approach  is  an  integrated  approach  that  combines  both  spatial  and  temporal  redundancies.  In 
this  approach  we  transform  each  message  into  multiple  copies,  with  at  most  one  copy  providing  temporal 
redundancy  through  allowance  for  migration.  The  remaining  copies  need  not  migrate  and  provide  spatial  redundancy. 
The  integrated  approach  strikes  a  balance  between  bandwidth  requirement  and  migration  overheads  of  pure  spatial 
and  pure  temporal  redundancy  approaches  and  hence  outperforms  both  of  them. 

In  summary,  our  solution  for  providing  fault-tolerant  real-time  communication  frilly  exploits  the  established  results 
on  reconfigurability  of  FDDI-based  networks  and  the  real-time  capabilities  of  FDDI. 

It  is  practical  because  it  is  modular  and  easily  scaleable  and  implementable.  The  network  architecture  itself  is 
composed  using  standard  FDDI  components.  The  on-line  management  tasks  are  implementable  in  a  distributed 
fashion.  The  bandwidth  allocation  scheme  is  one  that  is  already  accepted  by  the  SAFENET  standard.  The 
conformance  to  FDDI  and  SAFENET  standards  of  our  solution  is  a  highlight  of  this  study.  Next,  we  describe  our 
effort  to  extend  these  results  into  ATM  networks 

2.4  Deadline  Guarantees  in  ATM  Networks 

ATM  is  one  of  the  candidate  technologies  under  consideration.  Although  the  proposed  ATM  standard  specifies  that 
ATM  should  be  able  to  provide  bounded  delay  services,  how  to  best  realize  such  a  real-time  capability  has  so  fer 
been  an  open  problem.  We  aimed  at  addressing  this  problem  in  this  part  of  the  project.  Although  it  is  widely 
believed  that  ATM  can  support  real-time  traffic,  no  systematic  study  has  been  conducted  to  provide  quantitative 
analysis  and  design  guidelines  for  ATM  when  used  as  a  LAN  technology  for  real-time  communications. 

We  use  a  decomposition  method  to  derive  the  delay  bound  of  a  cell  in  an  ATM  network.  With  this  method,  the 
ATM  network  is  decomposed  into  a  set  of  servers.  A  server  can  be  a  link,  or  a  component  of  an  ATM  switch1. 
Delays  of  ATM  cells  at  each  server  are  analyzed.  The  total  delay  of  a  cell  from  a  particular  connection  is  then  the 
summation  of  the  delays  of  servers  the  connection  passes  through. 

1  For  example,  an  ATM  switch  can  be  decomposed  as  an  input  port  server,  switching  fabric  server,  and  an  output 
port  server.  For  a  detailed  description  of  decomposition  process,  see  [11, 20]. 


To  successfully  apply  this  methodology,  we  must  have  a  way  to  efficiently  and  effectively  specify  the  traffic  in  the 
network.  This  is  because  in  order  to  obtain  the  delay  bound  at  a  server,  information  regarding  traffic  characteristics  at 
the  server  entrance  must  be  known.  We  use  the  maximum  rate  function,  T(I)  to  specify  traffic.  Formally,  T(I)  is 
defined  as  the  maximum  data  arrival  rate  of  a  connection  at  any  point  in  the  network  during  a  time  interval  of  length 
I,  i.e., 

( Total  number  of  bits  arrived  in  (t,t  + 1] ''l 

HO  =  max[ - — - J 

Thus,  T(I)  is  an  “umbrella”  function  that  specifies  an  upper  bound  on  the  arrival  rates  that  may  appear  in  any 
interval  of  length  I.  Different  from  other  traffic  descriptors,  which  usually  use  a  few  numerical  parameters,  r(I) 
provides  a  comprehensive  picture  of  the  worst  case  demand  placed  by  a  connection  over  any  interval  of  time.  With 
the  input  traffic  of  a  server  modeled  by  r(I),  the  cell  delay  at  the  server  can  be  easily  obtained  [11],  Furthermore,  the 
characteristics  of  traffic  at  the  output  of  a  server  can  be  derived  in  terms  of  r(I)  as  well.  This  allows  recursive  analysis 
of  the  successive  servers. 

A  problem  with  using  T(I)  as  traffic  descriptor  is  that  we  must  know  the  values  of  r(I)  for  all  the  values  of  interval  I. 
Such  a  comprehensive  description  of  the  traffic  requires  large  amount  of  memory  space,  which  may  not  be  available 
in  practice.  To  address  the  problem,  approximation  methods  have  been  proposed  and  analyzed  [1 1,21].  In  particular, 
the  point  approximation  method  [11],  in  which  the  values  of  r(I)  at  certain  points  are  stored  and  others  are 
estimated,  was  found  to  be  useful.  Comprehensive  performance  evaluation  indicates  that  for  a  typical  networking 
environment,  a  6-point  approximation  provides  the  performance  that  is  almost  identical  to  that  obtained  by  using  the 
exact  r(I)  values  and  is  much  better  than  using  simple  bandwidth  figures  to  model  the  traffic. 

In  short,  using  approximated  r(I)  to  describe  traffic  of  individual  connections  in  the  network,  we  have  successfully 
developed  an  efficient  and  effective  method  for  deriving  the  delay  bounds  in  the  network.  For  the  theoretical 
background  on  the  delay  derivation  with  r(I),  please  refer  [19,21].  For  a  discussion  on  implementation  of  this  delay 
derivation  procedure,  please  refer  [20,24,27].  Once  again,  our  technology  is  compatible  with  the  existing  products. 
We  have  designed  and  implemented  a  software  suite,  called  NetEx/ATM,  that  can  provide  delay-guaranteed 
communication  services  for  mission  critical  applications  over  ATM  networks.  NetEx  has  been  fully  tested  in  our 
distributed  systems  lab  and  is  currently  being  extended  into  heterogeneous  networking  domain. 
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5.3  Interactions 

1.  With  CISCO  and  Baynetworks  on  using  their  source  code  of  the  FDDI  and  ATM  networks  in  order  to 
implementing  our  methods  on  real-time  communications  over  FDDI  and  ATM  networks.  Both  agreed  to 
provide  the  source  code  as  well  as  engineer’s  support.  The  total  support  is  worth  cash  equivalent  to  more 
than  $100,000. 

2.  The  P.I.  actively  participated  in  the  DoD  NGCR  program.  Specially,  the  P.I.  presented  the  research  results 
with  SAFENT  and  HPN  working  groups  and  discussed  with  the  members  of  the  working  groups  about 
utilizing  them.  As  a  result,  SAFENET  officially  adopted  one  of  the  bandwidth  allocation  algorithm 
invented  by  the  P.I.  and  his  students. 


